Dataset CSV file: nba_logreg.csv
Group No.: 18
Group Members:
VARINDER SINGH - 2021fc04070@wilp.bits-pilani.ac.in
BANDARU RAJA SEKHAR - 2021fc04074@wilp.bits-pilani.ac.in
MIKHIL. P.A. - 2021fc04326@wilp.bits-pilani.ac.in
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
NBA_dataset = pd.read_csv("nba_logreg.csv")
NBA_dataset.shape
NBA_dataset.info()
NBA_dataset = NBA_dataset.drop_duplicates(subset='Name', keep='last')
NBA_dataset.reset_index(inplace = True, drop = True)
NBA_dataset.shape
print("Number of Duplicates after processing the dataset: ",NBA_dataset.duplicated().sum())
NBA_dataset.head(2)
#Removing Name column
NBA_dataset = NBA_dataset.loc[ :, NBA_dataset.columns != 'Name' ]
Minority class(0) is ~37%, so the class imbalance is mild. This slight imbalance is often not a concern, and the problem can often be treated like a normal classification predictive modeling problem. Therefore, we will not take any action.
NBA_dataset.shape
plt.figure(figsize=(6, 4))
ax = sns.countplot( x="TARGET_5Yrs", data=NBA_dataset )
for p in ax.patches:
ax.annotate('{:}'.format(p.get_height()), (p.get_x()+0.35, p.get_height()+0.05))
from pandas.plotting import scatter_matrix
scatter_matrix(NBA_dataset, figsize=(50, 50))
plt.show()
corr = NBA_dataset.corr()
plt.figure(figsize=(20, 20))
mask = np.triu( np.ones_like(corr) )
hm = sns.heatmap( corr, mask=mask, vmin=-1, vmax=1, annot=True, cmap='Spectral' )
hm.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12)
Following are the pairs having correlation more than 0.9(>0.9)
Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features. Hence, on analysis, we will be removing the following columns:
NBA_dataset = NBA_dataset.drop(['MIN','PTS','REB','3P Made','FGM','FTM'], axis = 1)
NBA_dataset
#Dataset Information before Data pre-processing
NBA_dataset.info()
Replacing the NULL values present in columns of NBA_dataset with mean(average) value of their respective columns.
Columns which are having NULL values are as follows:
# NaN values being replaced by the mean value of the column
NBA_dataset['GP'] = NBA_dataset['GP'].fillna(NBA_dataset['GP'].mean())
NBA_dataset['3P%'] = NBA_dataset['3P%'].fillna(NBA_dataset['3P%'].mean())
NBA_dataset['FT%'] = NBA_dataset['FT%'].fillna(NBA_dataset['FT%'].mean())
NBA_dataset['OREB'] = NBA_dataset['OREB'].fillna(NBA_dataset['OREB'].mean())
NBA_dataset['AST'] = NBA_dataset['AST'].fillna(NBA_dataset['AST'].mean())
NBA_dataset['STL'] = NBA_dataset['STL'].fillna(NBA_dataset['STL'].mean())
#Dataset Information after Data pre-processing
NBA_dataset.info()
# Columns of dataset
col = ['GP','FGA','FG%','3PA','3P%','FTA','FT%','OREB','DREB','AST','STL','BLK','TOV','TARGET_5Yrs']
import warnings
warnings.filterwarnings('ignore')
plt.figure(figsize=(30,30))
num = 1
for i in col:
plt.subplot(10,4,num)
sns.distplot(NBA_dataset[str(i)])
num = num + 1
plt.subplot(10,4,num)
sns.boxplot(NBA_dataset[str(i)])
num = num + 1
NBA_dataset.boxplot(figsize = (20,10), column = col)
Winsorization of Outliers is the process of replacing the extreme values of statistical data in order to limit the effect of the outliers on the calculations or the results obtained by using that data.
for i in col:
percentile25 = NBA_dataset[str(i)].quantile(0.25)
percentile75 = NBA_dataset[str(i)].quantile(0.75)
iqr = percentile75 - percentile25
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
NBA_dataset[str(i)] = np.where(NBA_dataset[str(i)] >= upper_limit,
upper_limit,
np.where(NBA_dataset[str(i)] <= lower_limit,
lower_limit,
NBA_dataset[str(i)]
)
)
NBA_dataset.boxplot(figsize = (20,10), column = col)
Standard Scaler helps to get standardized distribution, with a zero mean and standard deviation of one (unit variance). It standardizes features by subtracting the mean value from the feature and then dividing the result by feature standard deviation.
MinMaxScaler scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset.
#Standardising the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
NBA_dataset_standardised = sc.fit_transform(NBA_dataset.iloc[: ,:-1])
NBA_dataset_standardised
#Adding the column name to standardised dataset
NBA_dataset_standardised = pd.DataFrame(NBA_dataset_standardised,
columns = ['GP','FGA','FG%','3PA',
'3P%','FTA','FT%','OREB','DREB','AST',
'STL','BLK','TOV'])
#Resetting the index
NBA_dataset_standardised = NBA_dataset_standardised.reset_index(drop = True)
NBA_dataset['TARGET_5Yrs'] = NBA_dataset['TARGET_5Yrs'].reset_index(drop = True)
#Adding target column to standardised dataset
NBA_dataset_standardised['TARGET_5Yrs'] = NBA_dataset['TARGET_5Yrs']
NBA_dataset_standardised
X = NBA_dataset_standardised.iloc[:, :-1].values
Y = NBA_dataset_standardised.iloc[:, -1].values
# logistic regression for feature importance
from sklearn.linear_model import LogisticRegression
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, Y)
# get importance
importance = model.coef_[0]
df = pd.DataFrame()
df["Column"] = NBA_dataset_standardised.columns[:-1]
df["Coeff. Value"] = importance
print(df)
#Plotting feature importance
plt.figure(figsize=(15,5))
plt.bar([x for x in range(len(importance))], importance)
plt.show()
The coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model. Recall that this is a classification problem with classes 0 and 1. Coefficients are both positive and negative. The positive scores indicate a feature that predicts class 1, whereas the negative scores indicate a feature that predicts class 0.
Case 1 : Train = 80 % Test = 20% [ x_train1,y_train1] = 80% ; [ x_test1,y_test1] = 20%
Case 2 : Train = 10 % Test = 90% [ x_train2,y_train2] = 10% ; [ x_test2,y_test2] = 90%
from sklearn.model_selection import train_test_split
#Case 1 with test size 20%
X_train_r_case1, X_test_r_case1, Y_train_r_case1, Y_test_r_case1 = train_test_split(X, Y, test_size = 0.2, random_state = 0)
X_train_case1 = np.c_[np.ones((X_train_r_case1.shape[0], 1)), X_train_r_case1]
X_test_case1 = np.c_[np.ones((X_test_r_case1.shape[0], 1)), X_test_r_case1]
Y_train_case1 = Y_train_r_case1[:, np.newaxis]
Y_test_case1 = Y_test_r_case1[:, np.newaxis]
#Case 2 with test size 90%
X_train_r_case2, X_test_r_case2, Y_train_r_case2, Y_test_r_case2 = train_test_split(X, Y, test_size = 0.9, random_state = 0)
X_train_case2 = np.c_[np.ones((X_train_r_case2.shape[0], 1)), X_train_r_case2]
X_test_case2 = np.c_[np.ones((X_test_r_case2.shape[0], 1)), X_test_r_case2]
Y_train_case2 = Y_train_r_case2[:, np.newaxis]
Y_test_case2 = Y_test_r_case2[:, np.newaxis]
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
k_fold_score = cross_val_score(LogisticRegression(), X, Y, cv=5)
print("K-Fold Cross Validation score with k = 5: \n\
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print("Individual Accuracies: \n", [i for i in k_fold_score] ,"\n")
print("Accuracies: {:.2f}% \n".format(k_fold_score.mean()*100))
print("Standard Deviation: {:.2f} \n".format(k_fold_score.std()*100))
k_fold_score = cross_val_score(LogisticRegression(), X, Y, cv=10)
print("K-Fold Cross Validation score with k = 10: \n\
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print("Individual Accuracies: \n", [i for i in k_fold_score] ,"\n")
print("Accuracies: {:.2f}% \n".format(k_fold_score.mean()*100))
print("Standard Deviation: {:.2f} \n".format(k_fold_score.std()*100))
#Initialising Theta
Theta_train_case1 = np.zeros((X_train_case1.shape[1], 1))
Theta_test_case1 = np.zeros((X_test_case1.shape[1], 1))
Theta_train_case2 = np.zeros((X_train_case2.shape[1], 1))
Theta_test_case2 = np.zeros((X_test_case2.shape[1], 1))
def sigmoid(a):
return 1.0 / (1 + np.exp(-a))
def cost(x, y, theta):
m = x.shape[0]
h = sigmoid(np.matmul(x, theta))
cost = (np.matmul(-y.T, np.log(h)) - np.matmul((1 -y.T), np.log(1 - h)))/m
return cost
def gradient_descent(theta, learning_rate, x , y):
m = x.shape[0]
h = sigmoid(np.matmul(x, theta))
grad = np.matmul(x.T, (h - y)) / m
J = cost(x, y, theta)
theta = theta - learning_rate * grad * J
return theta
def training_model_LR_GD(Theta, X_train, Y_train):
n_iterations = 10000
learning_rate = 0.05 # hyperparameter - fixed by trail and error methods
# to store the cost values
cost_history = []
l1_loss_history = []
l2_loss_history = []
#Theta values
Theta = np.zeros((X_train_case1.shape[1], 1))
for i in range( n_iterations+1 ):
Theta1 = gradient_descent(Theta, learning_rate, X_train, Y_train)
J_new = cost(X_train, Y_train, Theta1)
cost_history.append(J_new.flatten())
Theta = Theta1
# if i % 100 == 0:
# print('epoch = {}, cost = {}' .format(i, J_new))
print('Training completed')
return cost_history, Theta1
cost_history_train_case1, Theta_train_case1 = training_model_LR_GD(Theta_train_case1, X_train_case1, Y_train_case1)
cost_history_test_case1, Theta_test_case1 = training_model_LR_GD(Theta_test_case1, X_test_case1, Y_test_case1)
cost_history_train_case2, Theta_train_case2 = training_model_LR_GD(Theta_train_case2, X_train_case2, Y_train_case2)
cost_history_test_case2, Theta_test_case2 = training_model_LR_GD(Theta_test_case2, X_test_case2, Y_test_case2)
# Case-1 train prediction data
h = sigmoid(np.matmul( X_train_case1, Theta_train_case1 ))
Y_pred_train_case1 = (h > .5).astype(int)
# Case-1 test prediction data
h = sigmoid(np.matmul( X_test_case1, Theta_test_case1 ))
Y_pred_test_case1 = (h > .5).astype(int)
# Case-2 train prediction data
h = sigmoid(np.matmul( X_train_case2, Theta_train_case2 ))
Y_pred_train_case2 = (h > .5).astype(int)
# Case-2 test prediction data
h = sigmoid(np.matmul( X_test_case2, Theta_test_case2 ))
Y_pred_test_case2 = (h > .5).astype(int)
L1 Loss function is used to minimize the error which is the sum of the all the absolute differences between the true value and the predicted value.
L2 Loss Function L2 Loss Function is used to minimize the error which is the sum of the all the squared differences between the true value and the predicted value.
The cost function of L1 and L2 can’t be used in Logistic Regression because it is a non-convex function of weights. Optimizing algorithms like i.e gradient descent only converge convex function into a global minimum.
Cost Function would be:
J = - ylog( h(x) ) - ( 1 - y )log( 1 - h(x) )
here, y is the real target value
and, h( x ) is sigmoid( wx + b )
plt.figure(figsize=(10,8))
plt.suptitle('CASE 1 - Cost/Loss Function of Logistic Regression using Gradient Descent', fontsize=16)
plt.subplot(2,2,1)
plt.plot(cost_history_train_case1)
plt.title("Training Set")
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.subplot(2,2,2)
plt.plot(cost_history_test_case1)
plt.title("Testing Set")
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()
plt.figure(figsize=(10,8))
plt.suptitle('CASE 2 - Cost/Loss Function of Logistic Regression using Gradient Descent', fontsize=16)
plt.subplot(2,2,1)
plt.plot(cost_history_train_case2)
plt.title("Training Set")
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.subplot(2,2,2)
plt.plot(cost_history_test_case2)
plt.title("Testing Set")
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()
from sklearn.metrics import accuracy_score, roc_auc_score
print(" ============== CASE 1 ==============")
print(" ====== Train = 80% & Test = 20% ======")
print(" ======================================\n")
print("Accuracy Score of Train data: {0:.2f}%".format(accuracy_score(Y_train_case1, Y_pred_train_case1)*100))
print("Accuracy Score of Test data: {0:.2f}%".format(accuracy_score(Y_test_case1, Y_pred_test_case1)*100))
print("\n\n ============== CASE 2 ==============")
print(" ====== Train = 10% & Test = 90% ======")
print(" ======================================\n")
print("Accuracy Score of Train data: {0:.2f}%".format(accuracy_score(Y_train_case2, Y_pred_train_case2)*100))
print("Accuracy Score of Test data: {0:.2f}%".format(accuracy_score(Y_test_case2, Y_pred_test_case2)*100))
There are three types of gradient descent learning algorithms: batch gradient descent, stochastic gradient descent and mini-batch gradient descent.
Batch gradient descent
Batch gradient descent sums the error for each point in a training set, updating the model only after all training examples have been evaluated. This process referred to as a training epoch.
While this batching provides computation efficiency, it can still have a long processing time for large training datasets as it still needs to store all of the data into memory. Batch gradient descent also usually produces a stable error gradient and convergence, but sometimes that convergence point isn’t the most ideal, finding the local minimum versus the global one.
Stochastic gradient descent
Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and it updates each training example's parameters one at a time. Since you only need to hold one training example, they are easier to store in memory. While these frequent updates can offer more detail and speed, it can result in losses in computational efficiency when compared to batch gradient descent. Its frequent updates can result in noisy gradients, but this can also be helpful in escaping the local minimum and finding the global one.
Mini-batch gradient descent
Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic gradient descent. It splits the training dataset into small batch sizes and performs updates on each of those batches. This approach strikes a balance between the computational efficiency of batch gradient descent and the speed of stochastic gradient descent.
Logistic regression is a model for binary classification predictive modeling. The parameters of a logistic regression model can be estimated by the probabilistic framework called maximum likelihood estimation. Under this framework, a probability distribution for the target variable (class label) must be assumed and then a likelihood function defined that calculates the probability of observing the outcome given the input data and the model. This function can then be optimized to find the set of parameters that results in the largest sum likelihood over the training dataset.
The maximum likelihood approach to fitting a logistic regression model both aids in better understanding the form of the logistic regression model and provides a template that can be used for fitting classification models more generally. This is particularly true as the negative of the log-likelihood function used in the procedure can be shown to be equivalent to cross-entropy loss function.
Maximum Likelihood Estimation is a frequentist probabilistic framework that seeks a set of parameters for the model that maximizes a likelihood function. n Maximum Likelihood Estimation, we wish to maximize the conditional probability of observing the data (X) given a specific probability distribution and its parameters (theta), stated formally as:
P(X ; theta)
Where X is, in fact, the joint probability distribution of all observations from the problem domain from 1 to n.
P(x1, x2, x3, …, xn ; theta)
This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation L() to denote the likelihood function. For example:
L(X ; theta)
The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters. Multiplying many small probabilities together can be unstable; as such, it is common to restate this problem as the sum of the log conditional probability.
sum i to n log(P(xi ; theta))
Given the frequent use of log in the likelihood function, it is referred to as a log-likelihood function. It is common in optimization problems to prefer to minimize the cost function rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.
minimize -sum i to n log(P(xi ; theta))
The Maximum Likelihood Estimation framework can be used as a basis for estimating the parameters of many different machine learning models for regression and classification predictive modeling. This includes the logistic regression model.
Regularization is a technique used to reduce the errors by fitting the function appropriately on the given training set and avoid overfitting. Sometimes the machine learning model performs well with the training data but does not perform well with the test data. It means the model is not able to predict the output when deals with unseen data by introducing noise in the output, and hence the model is called overfitted. This problem can be deal with the help of a regularization technique.
There are mainly two types of regularization techniques, which are given below:
Ridge Regression
Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L2 regularization. In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the model is called Ridge Regression penalty. It is also called as L2 regularization.
Lasso Regression
It is similar to the Ridge Regression except that the penalty term contains only the absolute weights instead of a square of weights. Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink it near to 0. It is also called as L1 regularization.
from sklearn.model_selection import cross_val_score
#defining a generic Function to give ROC_AUC Scores in table format for better readability
def crossvalscore(model, X, y):
scores = cross_val_score(model,X,y,cv=5,scoring='roc_auc',n_jobs=-1)
acc = cross_val_score(model,X,y,cv=5,scoring='accuracy',n_jobs=-1)
rand_scores = pd.DataFrame({
'cv':range(1,6),
'roc_auc score':scores,
'accuracy score':acc
})
AUC = rand_scores['roc_auc score'].mean()
ACC = rand_scores['accuracy score'].mean()
print('AUC : ', AUC)
print('Accuracy Score: ', ACC)
return AUC, ACC
from sklearn.linear_model import LogisticRegression
print("+++++++++++++++++++++++++\n L1 Regularisation\n+++++++++++++++++++++++++\n")
#Case 1
print("--- CASE 1 ---")
log_clf = LogisticRegression(C = 0.1, class_weight= 'balanced', penalty= 'l1', solver= 'liblinear',random_state=42)
L1_reg_auc_case1, L1_reg_acc_case1 = crossvalscore(log_clf, X_train_r_case1, Y_train_r_case1)
#Case 2
print("\n--- CASE 2 ---")
log_clf = LogisticRegression(C = 0.1, class_weight= 'balanced', penalty= 'l1', solver= 'liblinear',random_state=42)
L1_reg_auc_case2, L1_reg_acc_case2 = crossvalscore(log_clf, X_train_r_case2, Y_train_r_case2)
print("\n+++++++++++++++++++++++++\n L2 Regularisation\n+++++++++++++++++++++++++\n")
#Case 1
print("--- CASE 1 ---")
log_clf = LogisticRegression(C = 0.1, class_weight= 'balanced', penalty= 'l2', solver= 'liblinear',random_state=42)
L2_reg_auc_case1, L2_reg_acc_case1 = crossvalscore(log_clf, X_train_r_case1, Y_train_r_case1)
#Case 2
print("\n--- CASE 2 ---")
log_clf = LogisticRegression(C = 0.1, class_weight= 'balanced', penalty= 'l2', solver= 'liblinear',random_state=42)
L2_reg_auc_case2, L2_reg_acc_case2 = crossvalscore(log_clf, X_train_r_case2, Y_train_r_case2)
#Getting Scores of Accuracy and AUC for WITHOUT Regularisation
WO_Reg_acc_case1 = accuracy_score(Y_train_case1, Y_pred_train_case1)
WO_Reg_acc_case2 = accuracy_score(Y_train_case2, Y_pred_train_case2)
WO_Reg_auc_case1 = roc_auc_score(Y_train_case1, Y_pred_train_case1)
WO_Reg_auc_case2 = roc_auc_score(Y_train_case2, Y_pred_train_case2)
warnings.filterwarnings('ignore')
!pip install tabulate
from tabulate import tabulate
data1 = [
["Without Regularization", WO_Reg_acc_case1*100, WO_Reg_acc_case2*100],
["L1 Regularization", L1_reg_acc_case1*100, L1_reg_acc_case2*100],
["L2 Regularization", L2_reg_acc_case1*100, L2_reg_acc_case2*100]
]
data2 = [
["Without Regularization", WO_Reg_auc_case1*100, WO_Reg_auc_case2*100],
["L1 Regularization", L1_reg_auc_case1*100, L1_reg_auc_case2*100],
["L2 Regularization", L2_reg_auc_case1*100, L2_reg_auc_case2*100]
]
print("++++++++++++++++++++++++++++++++++++++++++++++\n Accuracy Score\n++++++++++++++++++++++++++++++++++++++++++++++")
print(tabulate(data1 , headers=["", "Case 1", "Case 2"]))
print("\n\n++++++++++++++++++++++++++++++++++++++++++++++\n AUC\n++++++++++++++++++++++++++++++++++++++++++++++")
print(tabulate(data2 , headers=["", "Case 1", "Case 2"]))
Regularization did not improve the Accuracy Score as the accuracy score for without regularisation is more than the accuracy score for L1 or L2 regularisation. And, AUC is more in L1/L2 regularisation than without regularisation.
Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. We could see that AUC for Case-2 is more than Case-1; therefore, we could say that Case-2 model is more apt than Case-1 model for prediction.
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
def accuracy(y_pred, y_test):
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n", cm, '\n')
print("Accuracy Score : {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("Precision : {0:.2f}".format(precision_score(y_test, y_pred)))
print("Recall : {0:.2f}".format(recall_score(y_test, y_pred)))
print("F1 Score : {0:.2f}".format(f1_score(y_test, y_pred)))
print("=============================================================\n \
CASE - 1 | Prediction Evaluation metrics\n\
=============================================================\n")
print("++++++++++++++++++\nTest Set Accuracy: \n++++++++++++++++++")
accuracy(Y_pred_test_case1, Y_test_case1)
print("\n\n++++++++++++++++++\nTrain Set Accuracy: \n++++++++++++++++++")
accuracy(Y_pred_train_case1, Y_train_case1)
print("\n =============================================================\n \
CASE - 2 | Prediction Evaluation metrics \n \
=============================================================\n")
print("++++++++++++++++++\nTest Set Accuracy: \n++++++++++++++++++")
accuracy(Y_pred_test_case2, Y_test_case2)
print("\n\n++++++++++++++++++\nTrain Set Accuracy: \n++++++++++++++++++")
accuracy(Y_pred_train_case2, Y_train_case2)
Case 1 (Train = 80 % Test = 20%)
Accuracy of Training set is 71.79% and Accuracy for Test set is 75.68%. We could say that the case-1 model is underfit.
Case 2 (Train = 10 % Test = 90%)
Accuracy of Training set is 79.07% and Accuracy for Test set is 72.70%. We could say that the case-2 model is overfit.